library(tidyverse) # for graphing and data cleaning
library(tidymodels) # for modeling
library(themis)
library(doParallel) # for parallel processing
library(stacks) # for stacking models
library(naniar) # for examining missing values (NAs)
library(lubridate) # for date manipulation
library(moderndive) # for King County housing data
library(vip) # for variable importance plots
library(patchwork) # for combining plots nicely
library(ranger)
library(xgboost)
theme_set(theme_minimal()) # Lisa's favorite theme
data("lending_club")
# Data dictionary (as close as I could find): https://www.kaggle.com/wordsforthewise/lending-club/discussion/170691
We’ll be using the lending_club dataset from the modeldata library, which is part of tidymodels. The data dictionary they reference doesn’t seem to exist anymore, but it seems the one on this kaggle discussion is pretty close. It might also help to read a bit about Lending Club before starting in on the exercises.
The outcome we are interested in predicting is Class. And according to the dataset’s help page, its values are “either ‘good’ (meaning that the loan was fully paid back or currently on-time) or ‘bad’ (charged off, defaulted, or 21-120 days late)”.
This dataset has 23 variables. We are going to look at the distribution of quantitative and categorical variables.
lending_club %>%
select(where(is.numeric)) %>%
pivot_longer(cols = everything(),
names_to = "variable",
values_to = "value") %>%
ggplot(aes(x = value)) +
geom_histogram(bins = 30) +
facet_wrap(vars(variable),
scales = "free")
According to the graphs above, we can see that there are many variables that are right skewed such as annual_inc, inq_last_12m, num_il_tl, open_il_24m, open_il_6m, total_bal_il, and total_il_high_credit_li.
lending_club %>%
select(where(is.factor)) %>%
pivot_longer(cols = everything(),
names_to = "variable",
values_to = "value") %>%
ggplot(aes(x = value)) +
geom_bar() +
facet_wrap(vars(variable),
scales = "free",
nrow = 2)
We can see that there are 6 categorical variables in the dataset. These variables are all well distributed. If we look at the ‘Class’ variable, most of the data points are in good category.
set.seed(494) # for reproducibility
# remove the #'s once you've defined these - this is so we all have the same name
lending_split <- initial_split(lending_club, strata = 'Class',
prop = .75)
lending_training <- training(lending_split)
lending_test <- testing(lending_split)
step_upsample() from the themis library to upsample the “bad” category so that it is 50% of the “good” category. Do this by setting over_ratio = .5.step_downsample() from the themis library to downsample the “good” category so the bads and goods are even - set under_ratio = 1. Make sure to do this step AFTER step_upsample().step_mutate_at() and using the all_numeric() helper or this will be a lot of code). This step might seem really weird right now, but we’ll want to do this for the model interpretation we’ll do in a later assignment.Once you have that, use prep(), juice(), and count() to count the number of observations in each class. They should be equal. This dataset will be used in building the model, but the data without up and down sampling will be used in evaluation.
set.seed(456)
lasso_recipe <- recipe(Class ~ ., data = lending_training) %>%
step_upsample(Class, over_ratio = 0.5) %>%
step_downsample(Class, under_ratio = 1) %>%
step_mutate_at(all_numeric(), fn = ~as.numeric(.)) %>%
step_mutate(sub_grade = as.character(sub_grade),
grade = as.factor(str_sub(sub_grade,1,1)))%>%
step_rm(sub_grade) %>%
step_dummy(all_nominal_predictors()) %>%
step_normalize(all_numeric_predictors())
lasso_recipe %>%
prep(lending_training) %>%
juice()